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ABSTRACT 


Studying for entrance examinations can be a distressing pe- 
riod for numerous students. Consequently, many students 
decide to attend cram schools to assist them in preparing 
for these exams. For such schools and for all educational 
institutes, it is necessary to obtain the best tools to provide 
the highest quality of learning and guidance. Performance 
prediction is one tool that can serve as a resource for insights 
that are valuable to all educational stakeholders. With ac- 
curate predictions of their grades, students can be further 
guided and fostered in order to achieve their optimal learning 
goals. In this regard, we target middle school students to be 
able to guide them on their educational journey as early as 
possible. We propose a method to predict the students’ per- 
formance in entrance examinations using the comments that 
cram school teachers made throughout the lessons. Teachers 
in cram schools observe their student’s behavior closely and 
give reports on the efforts taken in their subject material. 
We show that the teachers’ comments are qualified to con- 
struct a tool that is capable of predicting students’ grades 
efficiently. This is a new method because previous studies 
focus on predicting grades mainly using student data such 
as their reflection comments or earlier scores. Experimen- 
tal results show that using readily available feedback from 
teachers can remarkably contribute to the accuracy of stu- 
dent performance prediction. 


Keywords 
text mining, student grade prediction, teacher observation 
reports, machine learning 


1. INTRODUCTION 


“If you could reinvent higher education for the twenty-first 
century, what would it look like?”. A question like this one 
invites many observations about the advantages and issues 
that the current state of higher education has in the world. 
As a matter of fact, this question has been addressed specif- 
ically by the founders of the Minerva Schools at KGI [1] 
in the United States. At such innovative universities and 
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schools, active learning and student engagement with the 
material are highly encouraged [2, 3, 4, 5, 6]. Additionally, 
the student/teacher ratio is expected to be lower than in 
traditional schools for higher teacher effectiveness [7]. Stu- 
dents are assessed and observed closely by their teachers 
and they can receive written feedback from their teachers 
daily. These reports clarify any confusion, reinforce strong 
points and give more specific advice and guidance [8, 9]. 
Besides, since teachers frequently engage with students, re- 
search has proven that these teachers, especially those with 
professional development, can accurately judge and forecast 
their students’ computational skills [10]. 


In this paper, we propose a novel method for predicting stu- 
dents’ performance or final grades. We show that we can 
use reports carefully written by teachers that closely observe 
the students, to construct a grade prediction model. If these 
predictions can be made accurately, it would be an invalu- 
able resource to help the teachers better regulate their stu- 
dents’ learning. Future performance prediction is considered 
a powerful means that can provide all educational stakehold- 
ers with insights that are beneficial to them. Many grade 
prediction models have been proposed by researchers in the 
last decade [11, 12, 13], but no model has used teacher re- 
ports as far as we know. The teacher reports we use are 
provided by a cram school in Japan. Cram schools are 
specialized in providing extra and more attentive education 
for students who want to achieve certain goals, particularly 
studying for high school or university entrance exams [14]. 
To capture the meanings of the teacher reports, we obtain 
vector representations by applying the term-frequency in- 
verse document-frequency (TF-IDF) method and extracting 
BERT embeddings. Our model uses these vectorized reports 
as the explanatory variables for a Gradient Boosting regres- 
sor. The regressor then predicts the students’ scores. Our 
experiment results show that when adding teachers’ reports 
to the regular student exam scores, we can predict their let- 
ter grade with an accuracy up to 62%. To sum up, our 
contributions can be outlined as follows: 


e We propose a new performance prediction method us- 
ing teacher observation reports represented using TF- 
IDF and BERT. 

e We conducted 2 main different models of prediction 
and compared the experiment results to show that us- 
ing teacher reports has the potential to contribute to 
an increase in accuracy of grade prediction models. 


All in all, to the best of our knowledge, this is the first 
study to use NLP to mine teacher observation comments 
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to predict student grades. Our research and experimental 
results demonstrate the potential that these unstructured 
teacher observation comments have in predicting students’ 
total scores and final letter grades. 


2. RELATED WORK 


The utilization of data mining and machine learning or deep 
learning tools to construct predictive models are increas- 
ingly being adopted in many different fields [15, 16]. Need- 
less to say, the educational field has not been an exception. 
Topics in educational data mining vary widely from course 
recommendation systems [17] to automatic assessment [18]. 
More specifically, an extensive amount of studies have been 
dedicated to prediction modeling whether it be predicting 
student grades or performance such as next-term grade pre- 
diction [19] or student dropout. These prediction models 
are essential since they underlie applications to important 
educational Al-based decision-making systems [20]. With 
accurate predictions, the performance of students can be 
monitored using these systems, and students that have dif- 
ficulties in their studies can easily be detected and given 
further guidance early on. 


Over the past years, several methods have been developed 
to predict student’s performance using Natural Language 
Processing (NLP) techniques. It has been proven that min- 
ing unstructured text using NLP has the capacity to con- 
tribute to accurately predicting students’ success over the 
information obtained from usual fixed-response items [21]. 
Luo et. al [13] proposed a method to predict student grades 
based on their free-style reflection comments collected after 
each lesson. The comments were collected according to the 
PCN method [22] that categorizes the students’ comments. 
To represent the students’ reflection comments, Word2Vec 
embeddings were adopted followed by an artificial neural 
network. Their experiments show a correct rate of 80%. 
Teacher or advisor notes have been used by Jayaraman, not 
to predict student grades, but to detect students that are 
at risk of dropping out of college [23]. In their study, they 
use sentiment analysis to extract the positive and negative 
sentiment from the advisors’ notes and use those as features 
to train a model. The model achieves 73% accuracy at iden- 
tifying at-risk students. 


3. DATA DESCRIPTION 


The dataset obtained and used for our model was provided 
by a cram school in Fukuoka, Japan. To ensure confiden- 
tiality, no student names or other identifying data were pre- 
sented. Reports were obtained monthly and sent as CSV 
files. Since our model is focused on predicting the perfor- 
mance of students in their entrance examinations, we fo- 
cused on those students in their final year of middle school. 
The final dataset after preprocessing composed of 11,960 re- 
ports over the period from May to October for 159 students. 


3.1 Monthly Reports 

In addition to the student ID and the class date, each report 
also consisted of the subject code, the teacher’s comments, 
understanding, attitude and homework scores. More data 
in the reports were also provided but were unstructured and 
considered redundant for the prediction model. The fea- 
tures that were extracted from the reports and used in the 


Table 1: Number of Reports in Each Subject 

Japanese Math Science Social English 

Wimberof Reporte 1157 3547 2428 1669 3159 
(9.7%) (29.6%) (20.3%) (14%) (26.4%) 


study are discussed in more detail in Section 4.1. However 
our main explanatory variable used in the study is the teach- 
ers’ observation comments written in Japanese. The average 
length of these comments is 96 characters. In addition, by 
analyzing comments, it was observed that teachers tend to 
encourage and energize their students by using words such 
as "better” and ”work on”. Moreover, the words used in the 
comments depend on the context or class subject to some 
degree. For example, the expression ”calculation problem” 
is likely to be used in math lessons. 


In the cram school, students take different lessons for each 


subject. These lessons fall under the 5 main subjects: Japanese, 


Mathematics, Science, Social Studies and English. Since the 
main objective of our model is to predict a student’s total 
score, reports in all 5 subjects are required. Therefore, test- 
ing the model was only possible for those students who at- 
tended classes for all subjects. The number of reports that 
fall under each subject are shown in Table 1. The values in 
the table show that the most taken lessons and therefore the 
most reports provided were in the subject of Mathematics 
followed directly by English. The number of total reports 
for each student varied depending on the classes attended. 
The average number of total reports recorded for each stu- 
dent was 82 reports with a maximum of 206 and a minimum 
of 24 reports. 


3.2 Test Scores 


Students attending the cram school were naturally regis- 
tered in many different schools. The results of their regu- 
larly taken examinations at school were recorded and pro- 
vided. These scores were what we considered student data 
and would be traditionally used as the main feature to pre- 
dict their performance in the entrance exam. To teach the 
model to perform these predictions, we adopted the super- 
vised learning method. In supervised learning, training data 
needs to be labeled with the required outputs for each in- 
put. This enables the model to train its learning function by 
altering it based on the correct result so that the function 
can then be applied to new inputs. In our study, we used 
the students’ results in their cram school simulation exams 
as the labels for the model since their actual performance in 
the entrance exam was unattainable. 


The simulation scores for the 159 students were recorded for 
all subjects and also provided as the total score. To visual- 
ize the distribution of the students’ scores, histograms were 
plotted as shown in Figure 1. The shape of the graph for 
the subject scores distribution and total score distribution is 
approximately bell-shaped and seems symmetric about the 
mean, so it is assumed that the scores follow the normal 
distribution. The standard deviation, o, for all scores are 
displayed in Table 2 to show how dispersed the values are. 


4. METHODOLOGY 
4.1 Feature Selection 
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Figure 1: Distribution of Simulation Test Scores 


Table 2: Standard deviation of subject scores 


Japanese Math Science Social English | Total 


oO 11.85 16.62 20.33 16.93 18.47 | 70.01 


For our experimental settings, we adopt 3 main feature sets 
for the sake of comparison. The first feature set, FSi, con- 
sists of using teachers’ report contents as the main explana- 
tory variables. A teacher’s report in one lesson evaluating 
the student consists of 1-Comments 2- Understanding Score 
3- Attitude Score and 4-Homework Score. We use all of these 
attributes except for the homework score. This is mainly be- 
cause more than 36% of the reports did not include home- 
work scores since not all lessons necessarily require home- 
work. After each lesson, the teacher writes some comments 
based on their observations, assesses the student on their un- 
derstanding giving them a score of either (0-30-60-80-100) 
and an attitude score of either (1-2-3-4). The second feature 
set, FS2, consists of student-related data only, specifically 
their gender and the score of their regularly scheduled exam 
at school. Since we predict each subject score separately, 
the regular score corresponds to the subject score. As for 
the students’ gender, the Pearson correlation coefficient be- 
tween it and the score is 0.12 while the correlation coefficient 
between the regular score and the simulation score is 0.80 
which suggests that the important factor in FS2 is essen- 
tially the student regular score and not the gender. Finally, 
we investigate using both teachers’ reports and the regu- 
lar student scores to verify whether adding teachers’ reports 
contributes to the accuracy of the prediction model or not. 
The third feature set, FS3, is essentially a concatenation of 
FS; and FS2. A sample of FS; is shown in Table 3. 


4.2 Natural Language Processing 

There are numerous ways to represent text data for a ma- 
chine learning model to convey the original meanings of 
the text and prevent information loss. In our experiments, 
we chose to represent the teachers’ comments using two 
techniques. We used the traditional TFIDF vectorization 
method and compared it with BERT embeddings. 


4.2.1 TF-IDF 


The first essential step in transforming text into a numer- 
ical representation is preprocessing the text. This step be- 
gins with tokenization or splitting the sentences into words. 
Tokenization in languages such as English can be done by 
splitting the sentence strings at each space. However, for 
Japanese, this step is merged with the next, which is mor- 
phological analysis, since there are no spaces in Japanese 
sentences. We use the fugashi [24] parser for this step, which 
is essentially a wrapper for Mecab', a Japanese tokenizer 
and morphological analysis tool. Our parser extracts from 
each report the following parts of speech: nouns, verbs, aux- 
iliary verbs, adjectives and adverbs. We use the correspond- 
ing terms to these extracted parts of speech to build a bag- 
of-words vector with weights given by the TF-IDF method 
implemented by sklearn [25]. Since the teachers’ comments 
are given in Japanese, we provide the mentioned parser to 
the tokenizer parameter. We also give a list of predefined 
Japanese stop words to the vectorizer. 


4.2.2 BERT 


BERT or Bidirectional Encoder Representations from Trans- 
formers is a new method of pre-training language represen- 
tations presented by Google [26]. BERT obtains state-of 
the-art results on many NLP tasks. It is a Transformer 
Encoder stack that pre-trains language representations. A 
pre-trained BERT model is basically a general purpose lan- 
guage understanding model trained on a large corpus which 
can then be used for downstream tasks. The BERT model 
we used for the comments was pretrained by Inui Labora- 
tory, Tohoku University”. The corpus they used for pretrain- 
ing was Japanese Wikipedia and the model was trained with 
the same configuration as the original BERT. In the experi- 
ments shown in this paper, we used the BERT [CLS] token 
embeddings as our BERT embeddings. 


4.3 Evaluation Metrics 

To evaluate our experiments, we use the Mean Absolute Er- 
ror (MAE) metric. The MAE is calculated using the follow- 
ing formula : 


1 n 
MAE = x |Scorepred,i — SCOTCtrue,i| (1) 


i=l 


where score¢rue,i is the actual score that student 7 obtained. 
The predicted score (scoreprea,i) is calculated differently for 
subject scores and total score. For a specific subject s € 
S, where S = {Japanese, Math, Science, Social Studies, En- 
glish}, a student i can attend a variable number t of lessons. 
Therefore, to predict the subject score (SubjectScore,,..4.;,5) 
of student 7 we use each of their reports as independent in- 
puts to the model and obtain an ordered list Xj,5,4 of pre- 


‘https: //taku910. github. io/mecab/#parse 
*nttps://github.com/cl-tohoku/bert- japanese 
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Table 3: A sample of FS;: teachers’ reports (comments originally in Japanese) 


Understanding | Attitude Comments 

80 4 We are trying applied problems of resolution into factors. 
You look like making many mistakes carelessly, but know formulas very well. 

80 4 We are trying applied problems of resolution into factors. 
You look like making many mistakes carelessly, but know formulas very well. 
He took notes while watching the commentary and focused on the problem. 

100 4 If you keep going at this rate, you will be able to meet the target, the 5th time. So, let’s do 
our best! 


dicted scores for student;. The estimated score for the sub- 
ject is then decided using: 


SubjectScore = Med 


— 


Xi,s,t) 
= a if t is even 


pred,ti,s 


1(x,[*51] + X,[44]), if t is odd 
(2) 


To measure the central tendency, we used the median rather 
than the mean as it is robust to skewness and outliers. Nev- 
ertheless, if the estimations follow a normal distribution, 
the median would be close to the mean. The total predicted 
score (TotalScoreprea,i) can then be estimated by: 


TotalScoreprea,i = SS SubjectScore,,,..a,i,5 (3) 
ses 


Finally, since students receive letter grades for their total 
score, we map the estimated total score to its closest cor- 
responding letter grade according to the percentages shown 
in Table 4 [27]. We then compute the percentage of grades 
that are x ticks away from their actual grades. A tick, as 
specified by [28], is defined as the difference between two 
successive letter grades. We name this metric percentage by 
tick accuracy or PTA. PTAo stands for the Percentage by 0 
Tick Accuracy which means the model successfully predicted 
the letter grade with no error while PTA, is the percentage 
of incorrectly predicted grades but are 1 tick away from the 
true letter grade (e.g. A vs B). A similar metric was used in 
previous studies regarding grade prediction models [11, 28]. 


Table 4: Letter grades and their corresponding percentages 
Grade S A B C D F 
% 90-100 80-89 70-79 60-69 50-59 0-49 


5. EXPERIMENTS 
5.1 Model Overview 


In our experiments, we adopt gradient boosting, a composite 
machine learning algorithm. We employed its sklearn im- 
plementation, GradientBoostingRegressor [25] to predict 
the continuous value of the students’ scores in each subject. 
Since there is no prior research on the effect of using teacher 
observation reports in predicting students’ grades, we use 
the following method as the baseline in our experiment. At 
first, subject codes were unavailable for each teacher obser- 
vation record. Therefore, we constructed a model that used 
all of each student’s reports, regardless of the subject, to 
directly predict and estimate the total score according to 


Equation 2. We call this model, the ’Direct’ model. Subject 
codes then became accessible and we were able to map each 
report to its corresponding subject. Leveraging that, we cre- 
ated a separate regression model for each subject’s reports 
and estimated the total score as shown in Equation 3. This 
model is called the ‘Subjects’ model. 


5.2 Experimental Results 

All experiments in the study were evaluated using group 10- 
fold cross-validation. The advantage of group k-fold cross 
validation method is that all data are used for both training 
and testing, and each instance is used for testing once. This 
is especially useful in situations where data is limited. Since 
the dataset comprises reports for 159 students, we used 143 
students’ reports for each fold as the training set and 16 as 
the testing set. The number of reports or instances for each 
subject model, therefore, varied depending on how many 
lessons each student had attended. The average MAE, which 
is calculated as in Equation 1, of all ten folds was computed 
and used as the main evaluation metric. We ran the baseline 
Direct model with the 3 feature sets described in Section 4.1. 
Teachers’ comments were represented using BERT embed- 
dings. The performance results are shown Using all 3 feature 
sets, the Subjects model consistently outperforms the Direct 
baseline model. Specifically, predicting the total score using 
the Subjects model with FS3, which uses both teachers’ re- 
ports and student data, resulted in a decrease in MAE of 
5.62. Using teachers’ reports alone (FS1) resulted in a com- 
paratively higher MAE in both models. However, adding 
teachers’ reports to student data (FS3) showed a smaller 
value in MAE than using student data only (FS2) which 
suggests that teachers’ reports as features can contribute to 
the accuracy of the grade prediction model. 


Table 6 shows the MAE, PTAo and PTA, of each subject’s 
score prediction model. We ran the subject model with all 3 
feature sets. For FS; and FS3, we compared the performance 
of the two text representations, TF-IDF vectors and BERT 
embeddings. Values in bold indicate the leading scores for 
each metric in all subjects. In terms of MAE, using FS3 
consistently outperforms the other feature sets. It can also 
be seen that BERT embeddings tend to have better overall 


Table 5: Average MAE of total score prediction with Direct 
model vs Subject model using the 3 feature sets: FS2: student 
data, FS;: teacher reports, FS3: FS; + FS2 

FS. | FS: | FS3 
42.73 | 53.81 | 38.91 
36.83 | 52.02 | 33.29 


Direct 
Subjects 
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Table 6: Evaluation metric scores in all subjects using the 3 feature sets and comparing between using TFIDF for text repre- 


sentation vs using BERT embeddings. 


Values in bold indicate the best metric value in a specific subject. 


| Japanese | Math Science Social Studies | English | Total 
| MAE PTA, PTA: | MAE PTA, PTA: | MAE PTA p PTA; | MAE PTA PTA; | MAE PTAo PTA: | MAE PTAo PTA1 
FS2 | 10.32 0.37 0.20 10.96 0.53 0.12 | 15.02 0.49 0.12 | 13.43 0.51 0.087 12.48 0.58 0.12 | 36.83 0.58 0.15 
TFIDF FS: | 9.79 0.36 0.22 12.53 0.47 0.10 17.25 0.37 0.09 13.57 0.60 0.01 14.93 0.52 0.00 54.81 0.47 0.07 
FS3 | 9.16 0.38 0.20 10.37 0.50 0.16 14.07 0.44 0.16 12.08 0.56 0.086 | 12.10 0.58 0.13 35.19 0.621 0.14 
BERT FS, | 9.47 0.27 0.23 12.36 0.45 0.07 16.66 0.40 0.11 13.92 0.55 0.02 14.51 0.52 0.02 52.02 0.49 0.07 
FS3 | 9.32 0.37 0.22 10.12 0.52 0.18 13.31 0.43 0.18 12.00 0.53 0.095 | 10.99 0.62 0.11 33.29 0.622 = 0.17 
Average MAE for Each Subject Score Prediction Average PTA for Score Prediction 
= 0.8 mmm FS2: PTAg 
16 mm FS; mmm FS2: PTAg + PTAy 
FS3 0.7 mmm =FS3: PTAo 
14 mmm FS3: PTAg + PTA, 
0.6 
12 
z 10 = OP 
o 
: ‘ f 04 
Z 2 
6 0.3 
4 0.2 
2 0.4 
0 0.0 


Japanese Math Science Social Studies English 


Figure 2: Average MAE of subject scores across all FS 


performance than the TF-IDF vectors. Moreover, running 
the Subjects model with FS; using BERT resulted in lower 
MAE than when using TF-IDF. Finally, when predicting the 
total score, using FS3 with BERT held the top scores across 
all evaluation metrics. 


Figure 2 depicts the performance of each subject seperately 
in terms of MAE across the three feature sets. It can be ob- 
served that FS3 continuously achieves lower MAE than FS2 
and FS. In addition, as shown in Figure 3, FS3 also con- 
sistently achieves higher overall PTA. When predicting the 
total score, FS3 shows an increase of 6.2% in PTAo + PTA,. 
These results provide evidence and suggest that teachers’ 
reports can in fact add value and contribute to grade pre- 
diction models. 


6. DISCUSSION 


The results presented in the previous section can be sum- 
marized into the following main points. 

e The highest performance of the grade prediction model 
can be achieved by using a concatentation of the two 
feature sets, FS; and FSg. 

e When predicting the total score with teachers’ reports, 
using BERT embeddings outperforms TF-IDF. 

The success of BERT can be attributed to the fact that 
the BERT model has been pretrained on huge corpora of 
Japanese text data. TFIDF vectors, on the other hand, only 
use the data on hand to produce the representations. How- 
ever, an important advantage of TFIDF is that the numer- 
ical vector representations are computed much faster than 
extracting BERT embeddings. To further increase the ac- 
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Japanese Math Science Social Studies English Total 


Figure 3: A comparison of PTA metric evaluated when using 
FS2 and FS3 across all subject scores and total score 


curacy of the prediction model considering FS3 and FSj, we 
aim to pre-train BERT on each of the 5 subject reports. It 
has been proven that pretraining BERT on specific domains 
can lead to a significant increase in performance [29]. 


7. CONCLUSION 


At educational institutes where students are closely observed 
by their teachers, large amounts of unstructured data exist 
in the form of reports and comments. In this paper, we at- 
tempted to employ and take advantage of these comments 
to help identify students that may need extra guidance or 
attention. Our model used teacher observation comments 
to predict students’ total scores. We applied both TF-IDF 
and BERT embeddings to the observation comments and 
used the vectors as inputs to a gradient boosting regres- 
sor. Three main feature sets were employed in our model, 
teacher-related features, student-related features, and a con- 
catenation of both. The performance of our model on each 
set was then demonstrated. Our experimental results showed 
that the readily available teachers’ reports have the potential 
to create a grade prediction model. Using teachers’ reports 
can increase the accuracy of a grade prediction model that 
uses only students’ previous exam scores by 6.2%. However, 
there remains room for improvement in our experiments. We 
believe that with more teachers’ comments, the accuracy of 
our model could increase. We also plan to enhance the text 
representations by pretraining BERT on the teachers’ com- 
ments in advance. Additionally, we intend to experiment 
with another model architecture that would focus on clas- 
sifying the students’ performance first. We hope that with 
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such well-defined grade prediction models, we can help guide 
young students and provide a more focused and personalized 
education to them. 
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